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Abstract 

An Artin-Schreier tower over the finite field F p is a tower of field extensions generated by 
polynomials of the form X p — X — a. Following Cantor and Couveignes, we give algorithms 
with quasi-linear time complexity for arithmetic operations in such towers. As an application, 
we present an implementation of Couveignes' algorithm for computing isogenies between elliptic 
curves using the p-torsion. 
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1. Introduction 

Definitions. If U is a field of characteristic p, polynomials of the form P = X p — X — a, 
with a G U, are called Artin-Schreier polynomials; a field extension U'/U is Artin-Schreier 
if it is of the form U' = V[X)/P, with P an Artin-Schreier polynomial. 

An Artin-Schreier tower of height k is a sequence of Artin-Schreier extensions Uj/Uj_i, 
for 1 ^ i ^ k; it is denoted by (Uo, . . . ,Ufe). In what follows, we only consider extensions 
of finite degree over F p . Thus, U,; is of degree p l over Uo, and of degree p l d over F p , with 
d=[U :¥ p }. 

The importance of this concept comes from the fact that all Galois extensions of 
degree p are Artin-Schreier. As such, they arise frequently, e.g., in number theory (for 
instance, when computing p fc -torsion groups of Abelian varieties over F p ). The need for 
fast arithmetics in these towers is motivated in particular by applications to isogeny 
computation and point-counting in cryptology, as in (Q). 

* This research was partly supported by the INRIA "Equipes associees" ECHECS team, NSERC and 
the Canada Research Chair program. 
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Our contribution. The purpose of this paper is to give fast algorithms for arithmetic 
operations in Artin-Schreier towers. Prior results for this task arc due to Cantor (0) and 
Couveignes (0). However, the algorithms of (@) need as a prerequisite a fast multiplica- 
tion algorithm in some towers of a special kind, called "Cantor towers" in (@). Such an 
algorithm is unfortunately not in the literature, making the results of (@) non practical. 

This paper fills the gap. Technically, our main algorithmic contribution is a fast change- 
of-basis algorithm; it makes it possible to obtain fast multiplication routines, and by 
extension completely explicit versions of all algorithms of (@). Along the way, we also 
extend constructions of Cantor to the case of a general finite base field Uo, where Can- 
tor had Uo = F p . We present our implementation, in a library called FAAST, based on 
Shoup's NTL ((291 ). As an application, we put to practice Couveignes' isogeny computation 
algorithm ^) (or, more precisely, its refined version presented in (fioh). 

Complexity notation. We count time complexity in number of operations in F p . Then, 
notation being as before, optimal algorithms in Ufc would have complexity 0(p k d); most 
of our results are (up to logarithmic factors) of the form 0(p k+a d 1+l3 ). for small constants 
a, /3 such as 0, 1, 2 or 3. 

Many algorithms below rely on fast multiplication; thus, we let M : N — > N be a multi- 
plication function, such that polynomials in F p \ X] of degree less than n can be multiplied 
in M(n) operations, under the conditions of (|13L Ch. 8.3). Typical orders of magnitude for 
M(n) are O(n los2 ^) for Karatsuba multiplication or 0(nlog(n) loglog(n)) for FFT mul- 
tiplication. Using fast multiplication, fast algorithms are available for Euclidean division 
or extended CCD (H Ch. 9 & 11). 

The cost of modular composition, that is, of computing F(G) mod H, for F,G, H e 
F p [X] of degrees at most n, will be written C(n). We refer to (l3l Ch. 12) for a presen- 
tation of known results in an algebraic computational model: the best known algorithms 
have subquadratic (but superlinear) cost in n. Note that in a boolean RAM model, the 
algorithm of (flit ) takes quasi-linear time. 

For several operations, different algorithms will be available, and their relative efficien- 
cies can depend on the values of p, d and k. In these situations, we always give details 
for the case where p is small, since cases such as p — 2 or p = 3 are especially useful in 
practice. Some of our algorithms could be slightly improved, but we usually prefer giving 
the simpler solutions. 

Previous work. As said above, this paper builds on former results of Cantor (0) and 
Couveignes (@;@); to our knowledge, prior to this paper, no previous work provided the 
missing ingredients to put Couveignes' algorithms to practice. Part of Cantor's results 



were independently discovered by Wang and Zhu (|33f ) and have been extended in another 
direction (fast polynomial multiplication over arbitrary finite fields) by von zur Gathcn 
and Gerhard (0) and Mateer Q. 

This paper is an expanded version of the conference paper (fill ). We provide a more 
thorough description of the properties of Cantor towers (Section 3), improvements to 
some algorithms (e.g. the Frobenius or pseudo-trace computations) and a more extensive 
experimental section. 
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Organization of the paper. Section 2 consists in preliminaries: trace computations, 
duality, basics on Artin-Schreier extensions. In Section 3, we define a specific Artin- 
Schreier tower, where arithmetic operations will be fast. Our key change-of-basis algo- 
rithm for this tower is in Section 4. In Sections 5 and 6, we revisit Couveignes' algorithm 
for isomorphism between Artin-Schreier towers (Q) in our context, which yields fast arith- 
metics for any Artin-Schreier tower. Finally, Section 7 presents our implementation of 
the FAAST library and gives experimental results obtained by applying our algorithms to 
Couveignes' isogeny algorithm (Q) for elliptic curves. 



2. Preliminaries 

As a general rule, variables and polynomials are in upper case; elements algebraic over 
F p (or some other field, that will be clear from the context) are in lower case. 

2.1. Element representation 

Let Qo be in F p [Xo] and let (Gi)o^j<fc be a sequence of polynomials over F p , with Gi 
in F p [Ao, . . . , Xi\. We say that the sequence (Gi)o^i<fc defines the tower (Uo, . . . , Vk) if 
for i ^ 0, Uj = F p [Ao, . . . , Xi]/Ki, where Ki is the ideal generated by 

Pi = Xf — Xi — Gi-i(Xo, ■ ■ ■ , Xi_x) 



Pi — X 1 — X\ — Gq(Xq) 

Qo(x ) 

in F p [A"o, . . . , Xi\, and if U, is a field. The residue class of Xi (resp. Gi) in U,-, and thus 
in Uj+i, . . . , is written Xi (resp. 7j), so that we have x\ — Xi = 7,-1. 

Finding a suitable F p -basis to represent elements of a tower (Uo, . . . , Ufc) is a crucial 
question. If d = deg(Qo), a natural basis of Uj is the multivariate basis B; = {xq° ■ ■ ■ x^} 
with eo < d and ^ ej < p for 1 ^ j ^ i. However, in this basis, we do not have 
very efficient arithmetic operations, starting from multiplication. Indeed, the natural 
approach to multiplication in B.; consists in a polynomial multiplication, followed by 
reduction modulo (Qo, Pi, ■ ■ ■ , Pi)', however, the initial product gives a polynomial of 
partial degrees (2d — 2, 2p — 2 , . . . , 2p — 2), so the number of monomials appearing is not 
linear in [U* : ¥ p ] = p l d. See 01) for details. 

As a workaround, we introduce the notion of a primitive tower, where for all i, x% 
generates Uj over ¥ p . In this case, we let Qi € F p LY] be its minimal polynomial, of 
degree p % d. In a primitive tower, unless otherwise stated, we represent the elements of U, 
on the Fp-basis C, = (l,Xi, . . . ,x\ d ~ 1 ). 

To stress the fact that v £ U; is represented on the basis Ci, we write v H Uj. In 
this basis, assuming Qi is known, additions and subtractions are done in time p % d, mul 



tiplications in time 0(M(p' l d)) (|13l . Ch. 9) and inversions in time 0(M(p l d) log(p t d)) ()13l . 
Ch. 11). 

Remark that having fast arithmetic operations in Uj enable us to write fast algorithms 
for polynomial arithmetic in Uj[Y], where Y is a new variable. Extending the previous 
notation, let us write A H U, [Y] to indicate that a polynomial A e Uj [Y] is written on the 
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basis (xfY^) 0< 

Q<p i d,o^/3 °f Uj[y]. Then, given A, B ~\ Uj[Y], both of degrees less than 
n, one can compute AB H U»[F] in time 0(M(p l dn)) using Kronecker's substitution (|16l . 
Lemma 2.2). 

One can extend the fast Euclidean division algorithm to this context, as Newton 



iteration reduces Euclidean division to polynomial multiplication. The analysis of (|13 , 
Ch. 9) implies that Euclidean division of a degree n polynomial A H U, [Y] by a monic 
degree m polynomial B H Ui[Y], with to n, can be done in time 0(M(p l dn)). 

Finally, fast GCD techniques carry over as well, as they are based on multiplica- 
tion and division. Using the analysis of (0, Ch. 11), we see that the extended GCD 
of two monic polynomials A, B H Ui[y] of degree at most n can be computed in time 
0(M(pMnlog(n))). 

2.2. Trace and pseudotrace 

We continue with a few useful facts on traces. Let U be a field and let U' = U[X]/Q 
be a separable field extension of U, with deg(Q) = n. For a £ U', the trace Tr(a) is the 
trace of the U- linear map M a of multiplication by a in U'. 

The trace is a U-linear form; in other words, Tr is in the dual space U'* of the U-vector 
space U'; we write it Try/ / v when the context requires it. In finite fields, we also have 
the following well-known properties: 

Ttf^/f, : a t-> J2"e=o a ^> ( Pl ) 
Tr F<jm „/ F? = Tr ¥qm/¥q oTr ¥qmn/¥qm . (P 2 ) 

Besides, if U'/U is an Artin-Schrcicr extension generated by a polynomial Q and x is 
a root of Q in U', then 

Tr u7u (^') = for j < p - 1] Tr^/u^ 1 ) = -1. (P 3 ) 

Following (Q), we also use a generalization of the trace. The ??th pseudotrace of order to 
is the F p m -linear operator 

T(„, m ) : a i y YJtZa ^ ; 
for to = 1, we call it the nth pseudotrace and write T„. 

In our context, for n = [Vi : Uj] = p l ~ 1 and to = [Uj : F p ] = p° d, T( n . m )(v) coincides 
with Tijj^n (v) for v in U,;; however T( n . m )(v) remains defined for v not in Ui, whereas 
Tr v . /V .(v) is not. 

2.3. Duality 

Finally, we discuss two useful topics related to duality, starting with the transposition 
of algorithms. 

Introduced by Kaltofcn and Shoup, the transposition principle relates the cost of 
computing an F^-linear map / : V — > W to that of computing the transposed map 
/* : W* — > V* . Explicitly, from an algorithm that performs an r x s matrix- vector 
product b i — y Mb, one can deduce the existence of an algorithm with the same complexity, 
up to 0(r + s), that performs the transposed product c <-> M t c; see H; Gil HI) . However, 
making the transposed algorithm explicit is not always straightforward; we will devote 
part of Section 4 to this issue. 
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We give here first consequences of this principle, after (3(1 3l|; [l|). Consider a degree 
n field extension U — > V , where U' is seen as an U-vector space. For w in U', recall that 
M w : U' — > U' is the multiplication map M w (v) = vw. Its dual M* : U'* -> U'* acts on 
I £ U'* by M*(£)(v) = i{M w {v)) = l{vw) for v in U'. Wc prefer to denote the linear 
form M*(£) by w ■ £, keeping in mind that (w ■ £)(v) = (.(vw). 

Suppose then that D is a U-basis of U', in which we can perform multiplication in 
time T. Then by the transposition principle, given w on D and £ on the dual basis D*, 
we can compute w ■ £ on the dual basis D* in time T + 0(n). This was discussed already 
in (pill HI), and we will get back to this in Section 4. 

Suppose finally that U' is separable over U and that b £ U' generates U' over U; wc 
will denote by Q £ V[X] the minimal polynomial of b. Given w in U', we want to find 
an expression w = A(b), for some A £ U[X]. Hereafter, for P £ V[X] of degree at most 
e, we write rev e (P) = X e P(l/X) £ V[X]. Then, recalling that n = [V : U], we define 
£ ~ w ■ Try/ m £ V* and 

M = ^£(b j )X j 1 N = Miev n (Q) mod X™. (1) 

j<n 

This construction solves our problem: Theorem 3.1 in (jH) shows that w = A(b), with 
A = rev„_i(A r )Q' _1 mod Q. We will hereafter denote by FindParameterization(6, w) a 
subroutine that computes this polynomial A; it follows closely a similar algorithm given 
m Since this is the case we will need later on, we give details for the case where Q 
is Artin-Schrcier (so n — p): then, Q' = — 1, so no work is needed to invert it modulo Q. 

In the following algorithm, wc suppose that U' is presented as U' = U[X]/P, where P 
is Artin-Schrcier. Wc let x be the residue class of X in U'. 



FindParameterization 

Input w £ V' written as wo + ■ ■ • + u' P -ia: p_1 , b £ U' written as 6o + ■ ■ ■ + b p -ix p 
Output A polynomial A of degree less than p such that w = ^4(6) 

(1) let I = w ■ Tr v , /V 

(2) let M = Y,j <p £{V) x:j 

(3) let N = Mrev p (Q) mod X p 

(4) return — rev p -i(N) 



Proposition 1. If Q is Artin-Schreier, the cost of FindParameterization is 0(p 2 ) opera- 
tions (+, x) in U. 

Proof. By P3 , the representation of Try/ m in U'* is simply (0, . . . , 0, — 1) . Then by the dis- 
cussion above, if T is the cost of multiplying two elements of U' in the basis (1, . . . 
step 1 costs T + 0(p); this stays in Q(p 2 ) by taking a naive multiplication. Step 2 fits 
into the same bound, by the proof of (|3(X Th. 4). Taking the rev's in steps 3 and 4 is 
just reading the polynomials from right to left, thus this costs no arithmetic operation. 
Finally, step 3 features a polynomial multiplication truncated to the order p, this costs 
0(p 2 ) operations by a naive algorithm. □ 

Note that this cost can be improved with respect to p, by using fast modular compo- 
sition as in (pM)! ): we do not give details, as this would not improve the overall complexity 
of the algorithms of the next sections. 
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3. A primitive tower 



Our first task in this section is to describe a specific Artin-Schreier tower where arith- 
metics will be fast; then, we explain how to construct this tower. 

3.1. Definition 

The following theorem extends results by Cantor (@, Th. f .2), who dealt with the case 
U = F p . 

Theorem 2. Let Uo = F p LY ]/Qo, with Q irreducible of degree d, let xo = X mod Qo 
and assume that Tr Uo / Fp (a!o) ^ 0. Let (Gj)o^i<fc be defined by 

!Gq = Xq 
G\= X\ if p = 2 and d is odd, 
Gi = X^ in any other case. 

Then, (Gi)o^i<fc defines a primitive tower (Uo, . . . , Ufc). 

As before, for i ^ f, let P, = Xf — Xj — Gi-i and for i ^ 0, let Ki be the 
ideal (Qo,Px,...,Pi) in ¥ p [Xo, . . . ,Xi]. Then the theorem says that for i 0, Uj = 
F p [Ao, . . . , Xi]/Ki is a field, and that Xj = Aj mod Ki generates it over F p . We prove it 
as a consequence of a more general statement. 

Lemma 3. Let U be the finite field with p n elements and U'/U an extension field with 
[V : U] = p\ Let aeU'be such that 

Tr uvu (a) =f3jtQ, (2) 

then W p [/3] C ¥ p [a] and p l divides [F p [a] : F p [/?]]. 

Proof. Equation (2) can be written as /3 = ~J2jCt pJ ", thus F p [/3] C F p [a]. The rest of 
the proof follows by induction on i. If [U' : U] = f, then a = (3 and there is nothing 
to prove. If i 1, let U" be the intermediate extension such that [V : V"] = p and let 
a' = Truj/ fu" ( a ) > then, by P2, Tru"/u( Q ! / ) = /? an d by induction hypothesis divides 
[F p [a>] : F p [/3]]. ' 

Now, suppose that p does not divide [F p [a] : F p [a']]. Since F p [a'] C U", this implies 
that p does not divide [U"[a] : U"]; but a G U' and [U' : U"] = p by construction, so 
necessarily [U"[a] : U"] = 1 and a G U". This implies Trp/ (a) = pa = and, by P2, 
[3 = 0. Thus, we have a contradiction and p must divide [F p [a] : F p [a']]. The claim 
follows. □ 

Corollary 4. With the same notation as above, if Try/ /u(a) generates U over F p , then 
¥ p [a] = U'. 

Hereafter, recall that we write 7^ = Gi mod A 2 ;. We prove that the %'s meet the 
conditions of the corollary. 

Lemma 5. If p ^ 2, for i ^ 0, Uj is a field and, for i ^ 1, Tr u ./u._ 1 (7j) = — 7i_i. 
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Proof. Induction on i: for i = 0, this is true by hypothesis. For i ^ 1, by induction 
hypothesis Uo, ■ • ■ , U,_i are fields; we then set i' = i — 1 and prove by nested induction 
that Tr v / ¥p (ji>) ^ under the hypothesis that Uo,...,Uj' are fields. This, by (|24 , 
Th. 2.25), implies that Xf — Xi — 7j_i is irreducible in Uj_i[X; + i] and U; is a field. 

For i' = 0, Tr Uo / Fp (7o) = Tr Vo / ¥p (xq) is non-zero and we are done. For i' ^ 1, we 
know that 7,' = x^ p_1 = a^a;?, - , which rewrites 

(x v + 7 J /_i)xfr 1 = x\, + 7 i /_ixfr 1 = 7i'_i + Xi> + 7 i /_ix?r 1 . 

ByP 3 , weget Tr u ., /U .,_ i (7 i /) = -7^-1 and by P 2 , we deduce the equality Tr u ., /Fp (7 i /) = 
— Tru., ^^ (7j'_i). The induction assumption implies that this is non-zero, and the claim 
follows. □ 

Lemma 6. If p = 2, for i ^ 0, Uj is a field, for i 2, Tr^ (7,) = 1 + 7,-1 and 

f 1 4-<v„ if d , 

Tr Ul/Uo (7i) = 



70 if d even, 
if odd. 



Proof. The proof closely follows the previous one. For i' = 0, Ti Va / Vp (70) = Tru / r!j (:ro) 
is non-zero. For i' = 1 and d odd, Tr Ul/Uo (71) = Tr Dl/Uo (an) = 1 by P 3 , and Tr Uo/Fp (l) = 
d mod 2^0. For all the other cases ji> = xf,Xi' = 7f— i+(l+7i'— i)a;i/ , thus Tr-g., /u.,_, (7i') 
l+7i'_i by P3 and Ttu (1) = 0. In any case, using the induction hypothesis and P2, 
we conclude Try ™> (7j< ) = 1 and this concludes the proof. □ 

Proof of Theorem 2. If p ^ 2, by Lemma 5 and P2, Tru ; /u (7j) = (— l) l 7o, thus Uj = 
F p [7j] by Corollary 4 and the fact that 70 = xq generates Uo over F p . 

If p = 2, we first prove that Ui = F p [7i]. If d is odd, 71+71 = x implies Uo C F p [7i], 
but 71 Uo, thus necessarily Ui = F p [7i]. If d is even, Tr Dl / Do (71) = 1 + 70 clearly 
generates Uo over ¥ p , thus Ui = F p [7i] by Corollary 4. Now we proceed like in the p ^ 2 
case by observing that Tr v ./ Vl (7,) = 1 + 71 generates Ui over F p . 

Now, for any p, the theorem follows since clearly F p [7j] C F p [xj]. □ 

Remark that the choice of the tower of Theorem 2 is in some sense optimal between 
the choices given by Corollary 4. In fact, each of the Gj's is the "simplest" polynomial 
in F p [Xj] such that Tru./ Fp (7j) 7^ 0, in terms of lowest degree and least number of 
monomials. 

We furthermore remark that the construction we made in this section gives us a family 



of normal elements for free. In fact, recall the following proposition from i 171 Section 5) 



Proposition 7. Let U'/U be an extension of finite fields with [U' : U] = kp l where k 
is prime to p and let U" be the intermediate field of degree k over U. Then x € U is 
normal over U if and only if Tru'/U"^) is normal over U. In particular, if [U' : U] = p\ 
then x G U' is normal over U if and only if Try/ /u(x) ^= 0. 

Then we easily deduce the following corollary. 

Corollary 8. Let (Uo,...,Ufe) be an Artin-Schreier tower defined by some (Gj)o^j<fc. 
Then, every 7^ is normal over Uo; furthermore 7, is normal over F p if and only if 
Tr Ui/o (7i) is normal over F p . 
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In the construction of Theorem 2, if we furthermore suppose that 70 is normal over 
F p , using Lemma 5 we easily see that the conditions of the corollary are met for p ^= 2. 
For p = 2, this is the case only if [Uo : F p ] is even (we omit the proofs that if 70 is normal 
then so are —70 and 1 + 70). 

Remark. Observe however that this does not imply the normality of the Xi's. In fact, 
they can never be normal because r Tru i nj i _ 1 (xi) = by P3. Granted that 70 is normal 
over F p , it would be interesting to have an efficient algorithm to switch representations 
from the univariate F p -basis in Xi to the F p -normal basis generated by 

3.2. Building the tower 

This subsection introduces the basic algorithms required to build the tower, that is, 
compute the required minimal polynomials Q;. 

Composition. We give first an algorithm for polynomial composition, to be used in the 
construction of the tower defined before. Given P and R in F P [X], we want to compute 
P(R). For the cost analysis, it will be useful later on to consider both the degree k and 
the number of terms t of R. 

Compose is a recursive process that cuts P into c + 1 "slices" of degree less than p n , 
recursively composes them with R, and concludes using Horner's scheme and the linearity 
of the p-power. At the leaves of the recursion tree, we use the following naive algorithm. 



NaiveCompose 

Input P,Re ¥ P [X]. 
Output P(R). 

(1) write P = T,t=o P) Pi xi - with Pi G f p 

(2) let S = 0, p = 1 

(3) for i 6 [0, . . . ,deg(P)], let S = S + pip and p = pR 

(4) return 5" 



Lemma 9. NaiveCompose has cost (9(deg(P) 2 fc^). 

Proof. At step i, p and S have degree at most ik. Computing the sum S +PiP takes time 
0(ik) and computing the product pR takes time 0(ik£), since R has I terms. The total 
cost of step i is thus 0{ikt), whence a total cost of 0(dcg(P) 2 kl). □ 
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Compose 



Input P,P G ¥ P [X]. 
Output P(R). 

(1) let n = |log p (deg(P))j and c = deg(P) div p n 

(2) If n — 0, return NaiveCompose(P, R) 

(3) write P = E - =0 ^*' P " . with ^ e deg P < p 1 

(4) for i € [0, . . . , c], let Qi = Compose(Pi, P) 

(5) let Q = 

(6) for i € [c, . . . , 0], let Q = QR(X P ") + Q d 

(7) return Q 



Theorem 10. If R has degree k and £ non-zero coefficients and if dcg(P) = s, then 
Compose(P, R) outputs P(R) in time 0(ps log p (s)k£). 

Proof. Correctness is clear, since R p = R(X P ). To analyze the cost, we let K(c, n) be 
the cost of Compose when deg(P) ^ (c + l)p n , with c < p. Then K(c, 0) 6 0{c 2 k£). For 
?i > 0, at each pass in the loop at step 6, deg(Q) < cp n k, so that the multiplication 
(using the naive algorithm) and addition take time 0(cp n k£). Thus the time spent in the 
loop is 0(c 2 p n k£), and the running time satisfies 

K(c,n) < (c + l)K(p - l,n - 1) + 0(c 2 p n k£). 

Let then K'(n) = K(p — l,n), so that we have 

K'(0) e 0( P 2 k£), K'(n) < pK'(n - 1) + 0(p n+2 k£). 

We deduce that K'(n) £ 0(p" +2 nfc£), and finally K(c,n) G 0(cp™ +1 nfc£ + c 2 p n k£). The 
values c, n computed at step 1 of the top-level call to Compose satisfy cp" ^ s and 
n ^ log p (s); this gives our conclusion. □ 



A binary divide-and-conquer algorithm (|13l Ex. 9.20) has cost 0(M(sfc) log(s)). Our 



algorithm has a slightly better dependency on s, but adds a polynomial cost in p and 
£. However, we have in mind cases with p small and I = 2, where the latter solution is 
advantageous. 

Computing the minimal polynomials. Theorem 2 shows that we have defined a 
primitive tower. To be able to work with it, we explain now how to compute the minimal 
polynomial Qi of x% over F p . This is done by extending Cantor's construction (0), which 
had Uo = F p . 

For i — 0, we are given Qq G F p [Xo] such that Uo = ¥ p [Xq]/Qo(Xq), so there is 
nothing to do; we assume that Tr Uo / Fp (ceo) 7^ to meet the hypotheses of Theorem 2. 
Remark that if this trace was zero, assuming gcd(d,p) = 1, we could replace Qq by 
Qo(Xq — 1); this is done by taking R = Xq — 1 in algorithm Compose, so by Theorem 10 
the cost is 0(pd\og p (d)). 

For i = 1, we know that x\ — x\ = xq, so x\ is a root of Qo(Xf-Xi). Since Qo(Xf-Xi) 
is monic of degree pd, we deduce that Qi = Qo(Xf—Xi). To compute it, we use algorithm 
Compose with arguments Qq and R = X p — X\\ the cost is 0(p 2 d\og p (d)) by Theorem 10. 
The same arguments hold for i = 2 when p = 2 and d is odd. 
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To deal with other indexes i, we follow Cantor's construction. Let $ £ F P [X] be the 
reduction modulo p of the (2p — l)th cyclotomic polynomial. Cantor implicitly works 
modulo an irreducible factor of $. The following shows that we can avoid factorization, 
by working modulo $. 

Lemma 11. Let A = W p [X]/$ and let x = X mod For Q 6 F p [Y~], define Q* = 
n?=o 2 Q{x l Y). Then Q* is in ¥ P [Y] and there exists q* € F p [Y] such that = 

Proof. Let fi, . . . , F e be the irreducible factors of $ and let / be their common degree. 
To prove that Q* is in F p [Y], we prove that for j ^ e, Q* = Q* mod Fj is in F p [Y] and 
independent from j; the claim follows by Chinese Remaindering. 

For j ^ e, let a,j be a root of Fj in the algebraic closure of F p , so that Q* = 
n,£o Q( a% jY)- Since gcd(p^,2p — 1) = 1, Q* is invariant under Gal(F p //F p ), and thus 
in F p [y]. Besides, for j, j' ^ e, a,j = a*,, for some k coprimc to 2p — 1, so that Q* = Q*,, 
as needed. 

To conclude, note that for j ^ e, Q*(ajY) = Q*(Y), so that all coefficients of de- 
gree not a multiple of 2p — 1 are zero. Thus, has the form q*{Y 2p ~ 1 )] by Chinese 
Remaindering, this proves the existence of the polynomial q* . □ 

We conclude as in (0): supposing that we know the minimal polynomial Qi of Xi over 
F p , we compute Qj+i as follows. Since x% is a root of Qi, it is a root of Q*, so 7^ = £j P_1 
is a root of and Xi+i is a root of q*(Y p — Y). Since the latter polynomial is monic of 
degree p l+1 d, it is the minimal polynomial of Xi+i over ¥ p . 

Theorem 12. Given Qi, one can compute Qi+i in time (^(p 1 " 1 " 2 ^^^^'^)^-!^!^" 1 " 2 ^) log(p)). 

Proof. Let A = F p [X]/$. The algorithm of (0) computes <I> in time 0(p 2 ); then, polyno- 
mial multiplications in degree s in -A[Y"] can be done in time 0(M(sp)) by Kronecker sub- 
stitution. The overall cost of computing Q* is 0(M(p l+2 d) logp) using (fl3l . Algo. 10.3). To 
get Qj+i we use algorithm Compose with R = Y p — Y, which costs 0(p l+2 d\og p (p l d)). □ 

The former cost is linear in p l up to logarithmic factors, for an input of size p l d 
and an output of size p l+l d. 

Some further operations will be performed when we construct the tower: we will pre- 
computc quantities that will be of use in the algorithms of the next sections. Details are 
given in the next sections, when needed. 

4. Level embedding 

We discuss here change-of-basis algorithms for the tower (Uo, ■ • • , Vk) of the previous 
section; these algorithms are needed for most further operations. We detail the main case 
where Pi = Xf - Xi - Xf?' 1 ; the case P 1 =Xf-X 1 - X (and P 2 = X% + X 2 + Xi for 
p = 2 and d odd) is easier. 
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By Theorem 2, Ui equals ¥ p [Xi-i,Xi]/I, where the ideal I admits the following 
Grobner bases, for respectively the lexicographic orders Xi > and > X^, 



X 



x l -x^~ 1 
Qi-i(^Q-i) 



and 



Qi(Xi), 



with i?^ in F p [Xj]. Since deg(Qi_i) = p l 1 d and deg(Qi) = P l d, we associate the following 
Fp-bases of Ui to each system: 



C, 



(1, £Ci 



.., or 



. . . , a* 



<Kj<I>> 



(3) 



We describe an algorithm called Push-down which takes v written on the basis Ci and 
returns its coordinates on the basis D^; we also describe the inverse operation, called 
Lift-up. In other words, Push-down inputs v H Ui and outputs the representation of v as 

v = vq + vixi + • • • + Wp-ixf -1 , with all Vj H Uj_i (4) 

and Lift-up does the opposite. 

Hereafter, we let L : N — {0} — s- N be such that both Push-down and Lift-up can be 
performed in time L(i); to simplify some expressions appearing later on, we add the mild 
constraints thatpL(z) ^ L(i + l) and p M(p z d) € 0(L(i)). To reflect the implementation's 
behavior, we also allow precomputations. These precomputations are performed when we 
build the tower; further details are at the end of this section. 

Theorem 13. One can take L(i) in 0(p i+1 dlog p (pM) 2 + pM(p i d)). 

Remark that the input and output have size p l d; using fast multiplication, the cost is 
linear in p t+1 d, up to logarithmic factors. The rest of this section is devoted to proving this 
theorem. Push-down is a divide-and-conquer process, adapted to the shape of our tower; 
Lift-up uses classical ideas of trace computations (as in the algorithm FindParameterization 
of Section 2.3); the values we need will be obtained using the transposed version of Push- 
down. 

As said before, the algorithms of this section (and of the following ones) use precom- 
puted quantities. To keep the pseudo-code simple, we do not explicitly list them in the 
inputs of the algorithms; we show, later, that the precomputation is fast too. 

4-1. Modular multiplication 

We first discuss a routine for multiplication by Xf in ¥ p [Y, Xi]/ (Xf — Xi — Y), and 
its transpose. We start by remarking that Xf = Xi + R n mod Xf — Xi —Y, with 

Ru = e;=oYp 3 . (5) 

Then, precisely, for k in N, we are interested in the operation MulModfc.„ : A i-> (Xi + 
R n )A mod Xf -Xi-Y, with A e ¥ P [Y, X,], deg(A, Y) < k and deg(^, Xi) < p. 

Since R n is sparse, it is advantageous to use the naive algorithm; besides, to make 
transposition easy, we explicitly give the matrix of MulMod/;,,!. Let mo be the (k+p n ~ 1 )xk 
matrix having l's on the diagonal only, and for £ p n_1 , let mi be the matrix obtained 
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from mo by shifting the diagonal down by t places. Let finally m' be the sum £™_ m p j . 
Then one verifies that the matrix of MulMod^n is 

ml mi 
mo m' mo 
mo m! 

mo m' 

with columns indexed by (Xf , . . . , Y k ~ 1 X{)j <p and rows by (Xf , . . . , Y k+P " 1 ~ 1 Xf) J<p . 
Since this matrix has 0(pnk) non-zero entries, we can compute both MulModfc„ and its 
dual MulMod^ n in time 0(pnk). 

4-2. Push- down 

The input of Push-down is v H Uj, that is, given on the basis C,;; we see it as a 
polynomial V <E ¥ p [Xi] of degree less than p z d. The output is the normal form of V 
modulo Xf — Xi — Xf^ 1 and Qj_i(Xj_i). We first use a divide-and-conquer subroutine 
to reduce V modulo Xf — Xi — X^~ • then, the result is reduced modulo Q,_i(JQ_i) 
coefficient- wise. 

To reduce V modulo Xf — Xi — Xf^ 1 , we first compute W — V mod Xf — Xi — Y, 
then we replace Y by Xf^ 1 in W. Because our algorithm will be recursive, we let deg(V) 
be arbitrary; then, we have the following estimate for W. 

Lemma 14. We have deg(W, Y) < deg(V)/p. 

Proof. Consider the matrix M of multiplication by Xf modulo Xf — Xi~Y; it has entries 
in F p [y]. Due to the sparseness of the modulus, one sees that M has degree at most 1, and 
so M k has coefficients of degree at most k. Thus, the remainders of Xf k , . . . ,Xf fc+p ^ 1 
modulo Xf — Xi —Y have degree at most k in Y. □ 

We compute W by a recursive subroutine Push-down-rec, similar to Compose. As 

before, we let c, n be such that 1 ^ c < p and deg(V) < (c + l)p n , so that we have 

V = V + VxXf + ■■■ + V c X c f , 

with all Vj in F p [X,] of degree less thanp". First, we recursively reduce Vq, ■ ■ ■ ,V C modulo 
Xf — Xi — Y, to obtain bivariatc polynomials Wq, . . . , W c . Let R n be the polynomial 
defined in Equation (5). Then, we get W by computing T^_ Q Wj{Xi + R n y modulo 
Xf — Xi —Y, using Horner's scheme as in Compose. Multiplications by Xi + R n modulo 
Xf -Xi-Y are done using MulMod. 
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Push-down-rec 



Input V € ¥ p [Xi] and c,n£l 
Output W e ¥ p [Y,Xi]. 

(1) if 7i = return V 

(2) write V = ViX?" , with V, £ ¥ p [Xi], degVj < p n 

(3) for j £ [0, . . . , c], let Wj = Push-down-rec(V, , p - l,n - 1) 

(4) ^ = 

(5) for j e [c,...,0], let W = MulMod (c+1)p „-i in (W) + W^ 

(6) return W 



Push-down 

Input v H Uj. 

Output ?j written as do H h v v ^\x^ ~ with H Ui_i. 

(1) let V be the canonical preimage of v in F p [Xi] 

(2) let n = Llog p (p*d - 1)J and c = (p*d - 1) div p n 

(3) let VP = Push-down-rec(V, c, n) 

(4) let Z = Evaluate(W, [X^ 1 ,Xi\) 

(5) let Z = Z mod Qi_i 

(6) return the residue class of Z mod (Xf — Xi — X?^ 1 , Qi-i) 



Proposition 15. Algorithm Push-down is correct and takes time 0(p l+1 d\og p (p l d) 2 + 
pM(p ! d)). 

Proof. Correctness is straightforward; note that at step 5 of Push-down-rec, dcg(W, Y) < 
(c + l)p n_1 , so our call to MulMod( c+1 ) p ™-i „ is justified. By the claim of Subsection 4.1 
on the cost of MulMod, the total time spent in that loop is 0{nc 2 p n ). As in Theorem 10, 
we deduce that the time spent in Push-down-rec is 0(n 2 c 2 p n ). 

In Push-down, we have cp n < p l d and n < log p (p t d), so the previous cost is seen to be 
0{p l+1 d\og p {p l d) 2 ). Reducing one coefficient of Z modulo Q,_i takes time 0(M(p l d)), 
so step 5 has cost 0(pM(p l d)). Step 6 is free, since at this stage Z is already reduced. □ 

4-3. Transposed push- down 

Before giving the details for Lift-up, we discuss here the transpose of Push-down. Push- 
down is the Fp-linear change-of-basis from the basis Cj to D^, so its transpose takes 
an Fp-linear form I £ U* given by its values on D,, and outputs its values on Cj. The 
input is the (finite) generating series L = T, a<p i-i d b<p £(xf_ 1 x^)X^_ 1 Xf; the output is 
M = X a<f i d £(xt)X?. 

As in (|l|) , the transposed algorithm is obtained by reversing the initial algorithm step 
by step, and replacing subroutines by their transposes. The overall cost remains the same; 
we review here the main transformations. 

In Push-down-rec, the initial loop at step 5 is a Horner scheme; the transposed loop is 
run backward, and its core becomes Lj = L mod y"^ 1 and L = MulMod* c+1 -,p„-i „(£); 
a small simplification yields the pseudo-code we give. In Push-down, after calling Push- 
down-rec, we evaluate W at [X^ ,Xi\: the transposed operation Evaluate* maps the 
series £ Q) b ia^Xf^X^ to £ a .b^(2 P -i)a,&^ a A 4 fc . Then, originally, we perform a Euclidean 
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division by on Z. The transposed algorithm mod* is in QjJ, Sect. 5.2): the transposed 
Euclidean division amounts to compute the values of a sequence linearly generated by 
the polynomial Qi-i from its first p l ~ 1 d values. 



Push-down-rec* 

Input L e ¥ p [Y,Xi] and c,n £ N. 
Output M e W p [Xi] 

(1) If n = return L 

(2) for j e [c,...,0], 

• let Lj = L mod V" -1 

• let Mj = Push-down-rec* (Lj, p — 1, n — 1) 
. letL = MulMod* c+1)p „_ 1-n (L) 

(3) return E^o^f" 



Push-down* 



Input L e W p [Xi-i,Xi] 
Output A/ G F p [Xj] 

(1) let n = Llogp(p' d - !)J and c = (p id ~ 1) div P' 

(2) let P = mod*(L,Qi_i) 

(3) let M = Evaluate* (P, [X^^ 1 , X t ]) 

(4) return Push-down-rec* (M, c, n) 



Lift-up 

Let i> be given on the basis D; and let W be its canonical prcimage in ¥ p , Xj] . The 
lift-up algorithm finds V in F p [X] such that = V mod (Xf - X,- - Xf^ 1 , Q^) and 
outputs the residue class of V modulo Qi. Hereafter, we assume that both Q^ 1 mod Qi 
and the values of the trace Tv^./f on the basis arc known. The latter will be given 
under the form of the (finite) generating series 

&i = Ha<p i - 1 d,b<p^Vi/¥ p { x i-lX b i ) X i-l X h 

see the discussion below. 

Then, as in Subsection 2.3, we use trace formulas to write v as a polynomial in xf we 
see Uj as a separable extension over ¥ p and we look for a parameterization v = A(xi). 
To do this, we compute the values of L = v ■ Try./f on the basis via transposed 
multiplication (see Subsection 2.3) and rewrite equations (1) as 

M=Y J L(4)*i, ^V = Mrev ptd (g i )modXf d . (6) 

j<p i d 

To compute the values of M we could use |30l Th. 4) as we did in step 2 of FindParame- 
terization; it is however more efficient to use Push-down* as it was shown in the previous 
subsection. The rest of the computation goes as in steps 3 and 4 of FindParametrization. 
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Lift-up 



Input v written as vo + ■■■ + Vp-iXf 1 with vj H Vi-%. 
Output uHDi. 

(1) let W be the canonical preimage of v in F p [Xi_i, Xi] 

(2) let L = TransposedMuKW, Si) 

(3) let M = Push-down* (L) 

(4) let N = Mrev pid (Qi) mod xf d 

(5) let V = rev pld _ 1 (iV)Qr 1 mod Q { 

(6) return the residue class of V modulo Q t 



Proposition 16. Algorithm Lift-up is correct and takes time 0(p l+1 d\og p (p l d) 2 +p M(p l d)). 
Proof. Correctness is clear by the discussion above. TransposedMul implements the trans- 



posed multiplication; an algorithm of cost 0(M(p l d)) for this is in (|26l . Coro. 2). The last 
subsection showed that step 3 has the same cost as Push-down. Then, the costs of steps 4 
and 5 are 0(M(p l d)) and step 6 is free since V is reduced. □ 

Propositions 15 and 16 prove Theorem 13. The precomputations, that are done at the 
construction of U,, are as follows. First, we need the values of the trace on the basis D^; 
they are obtained in time 0(M(p l d)) by fsiil . Prop. 8). Then, we need Q'~ X mod Qi] this 
takes time 0(M(p l d) log(p l d)) by fast extended GCD computation. These precomputa- 
tions save logarithmic factors at best, but are useful in practice. 



5. Frobenius and pseudotrace 

In this section, we describe algorithms computing Frobenius and pseudotrace opera- 
tors, specific to the tower of Section 3; they are the keys to the algorithms of the next 
section. 

The algorithms in this section and the next one closely follow Couveignes' (0). How- 
ever, the latter assumed the existence of a quasi-linear time algorithm for multiplication 
in some specific towers in the multivariate basis of Subsection 2.1. To our knowledge, 
no such algorithm exists. We use here the univariate basis C, introduced previously, which 
makes multiplication straightforward. However, several push-down and lift-up operations 
are now required to accommodate the recursive nature of the algorithm. 

Our main purpose here is to compute the pseudotrace T p j d : x <-> X^=o _1 x%> ■ First, 
however, we describe how to compute values of the iterated Frobenius operator x i-> x p 
by a recursive descent in the tower. 

We focus on computing the iterated Frobenius for n < d or n = pPd. In both cases, 
similarly to (5), we have: 

x'l =Xi+Pi-i in , with Pi-i,n = T n (7,_i). (7) 

Assuming ft_i, n is known, the recursive step of the Frobenius algorithm follows: starting 
from v H Uj, we first write v = vq + ■ ■ ■ + i> p _ia;^ -1 , with Vh H U.;_i; by (7) and the 
linearity of the Frobenius, we deduce that 

vP " = Efe=0 V h ( X i + Pi-hnf ■ 
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Then, we compute all v v h recursively; the final sum is computed using Horner's scheme. 
Remark that this variant is not limited to the case where n < d or of the form p° d: an 
arbitrary n would do as well. However, we impose this limitation since these are the only 
values we need to compute T p jd- 

In the case n = p 3 d, any v G U/ is left invariant by this Frobcnius map, thus we stop 
the recursion when i = j, as there is nothing left to do. In the case n < d, we stop the 
recursion when i = and apply (flil . Algorithm 5.2). We summarize the two variants in 
one unique algorithm IterFrobenius. 



IterFrobenius 



Input v, i, n with wHUi and n < d or n = p 3 d. 
Output v p " H Uj. 

(1) if n = p J d and i $J j, return v 

(2) if i = 0, return v p " 

(3) let vq + ViXi + ■ ■ ■ + v v _\x v ~ = Push-down(u) 

(4) for h 6 [0, . . . ,p — 1], let th = lterFrobenius(uh, i — 1, n) 

(5) let F = 

(6) for fee [p-l,...,0], let F = t ft + (xi + A-i,n)-F 

(7) return Lift-up(F) 



As mentioned above, the algorithm requires the values for i' < i: we suppose that 
they are precomputcd (the discussion of how we precompute them follows). To analyze 
costs, we use the function L of Section 4. 

Theorem 17. On input v H Uj and n = p 3 d, algorithm IterFrobenius correctly computes 
v p and takes time 0((i — j)L(i)). 

Proof. Correctness is clear. We note for the complexity on inputs as in the state- 

ment; then F(0, j) = ■ • • = = because step 1 comes at no cost. For i > j, each 

pass through step 6 involves a multiplication by xi + /3j_i.„, of cost of 0(pM(p l ~ 1 d)), 
assuming ft_i )n H Uj_i is known. Altogether, we deduce the recurrence relation 

F(«, j) <pF(< - 1, j) +2L(<) + OtfMip*- 1 *)), 

so F(i, j) ^ p F(i — 1, j) + 0(L(i)), by assumptions on M and L. The conclusion follows, 
again by assumptions on L. □ 

Theorem 18. On input v H Uj and n < d, algorithm IterFrobenius correctly computes 
v p " and takes time 0(p l C(d) log(n) + ih(i)). 

Proof. The analysis is identical to the previous one, except that step 2 is now executed 
instead of step 1 and this costs 0(C(d) log(n)) by (|14l . Lemma 5.3). The conclusion follows 
by observing that step 2 is repeated p % times. □ 

Next, we compute pseudotraces. We use the following relations, whose verification is 
straightforward: 

rn— 1 

T n+m (u) = T n (v) +T m (v) p , T nm (v) = ^2 T„(«) p . 

h=a 
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We give two divide- and- conquer algorithms that do a slightly different divide step; each of 
them is based on one of the previous formulas. The first one, LittlePseudotrace, is meant to 
compute TV It follows a binary divide- and-conquer scheme similar to (|14l . Algorithm 5.2). 
The second one, Pseu dot race, computes T p id for j > 0. It uses the previous formula with 
n = pi~ 1 d and m = p, computing Frobenius-es for such n; when j = 0, it invokes the 
first algorithm. 



LittlePseudotrace 

Input v, i, n with v H Oj and < n ^ d. 
Output T n (v) H Uj. 

(1) if n = 1 return v 

(2) let m = [n/2\ 

(3) let 4 = LittlePseudotrace(L\ i, m) 

(4) let £ = t+ lterFrobenius(£, i, m) 

(5) if n is odd, let t — t+ lterFrobenius(u, i, n) 

(6) return t 



Pseudotrace 

Input v, i, j with v H Of. 
Output T p3d (v) HU„ 

(1) if j = return LittlePseudotrace(i\ d) 

(2) £ =Pseudotrace(u, i, j — 1) 

(3) for h G [1, . . . ,p — 1], let th = lterFrobenius(£h_i, i,j — 1) 

(4) return £ + ti H h 



Theorem 19. Algorithm LittlePseudotrace is correct and takes time 0(p l C(d) log 2 (n) + 
iL(i)log(n)). 

Proof. Correctness is clear. For the cost analysis, we write PT(i, n) for the cost on input 
i and n, so PT(i, 1) = O(l). For n > 1, step 3 costs PT(i, |/t/2j), steps 4 and 5 cost 
both 0(p l C(d)log 2 {n) + ih(i)) by Theorem 18. This gives PT(i,n) = PT(i, L"/2J) + 
0(p*C(d)log 2 (n) + iL(i)), and thus PT(i,n) G 0{p l C(d) log 2 (n) + iL(i) logn). □ 

Theorem 20. Algorithm Pseudotrace is correct and takes time PT(i) = 0((pi+\og(d))iL(i)+ 
p i C(d)log 2 (d)) for j ^ i. 

Proof. Correctness is clear. For the cost analysis, we write PT(i, j) for the cost on input 
i and j, so theorem 19 gives PT(i, 0) = 0(p l Q(d) \og 2 (d) + iL(i) log(d)). For j > 0, step 2 
costs PT(i,j — 1), step 3 costs 0(piL(i)) by Theorem 17 and step 4 costs 0(p l+1 d). This 
gives PT(i,j) = PT(i, j - 1) + 0(piL(i)), and thus PT(i,j) e 0(pijL(i) + PT(i,0)). □ 

The cost is thus 0(p l+2 d + p l C(d)), up to logarithmic factors, for an input and output 
size of p l d: this time, due to modular compositions in Un, the cost is not linear in d. 

Finally, let us discuss precomputations. On input v, i, d, the algorithm LittlePseu- 
dotrace makes less than 21ogd calls to lterFrobenius(a;,i,n) for some value i £ Dj and 
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for n £ N where the set N only depends on d. When we construct Uj+i, we compute 
(only) all j3i i7l = T n (7j) H U,, for increasing n £ N, using the LittlePseudotrace algo- 
rithm. The inner calls to IterFrobenius only use pseudotraces that are already known. 
Besides, a single call to LittlePseudotrace^, i, d) actually computes all T n (7<) in time 
0(p l C(d) log 2 d+iL(i) logd). Same goes for the precomputation of all /3j lP i<j = T p jd(ji) ^ 
Ui, for j ^ z, using the Pseudotrace algorithm: this costs PT(i). Observe that in total 
we only store 0(k 2 + klogd) elements of the tower, thus the space requirements are 
quasi-linear. 



Remark. A dynamic programming version of LittlePseudotrace as in (|14j . Algorithm 5.2) 
would only precompute fc^e for 2 e < d, thus reducing the storage from 21ogd to [log rfj 
elements. This would also allow to compute T n for any n < d without needing any further 
precomputation. Using this algorithm and a decomposition of n > d as n = r + Y) j Cjp>d 
with r < d and Cj < p, one could also compute T n and x p at essentially the same cost. 
We omit these improvements since they are not essential to the next Section. 



6. Arbitrary towers 

Finally, we bring our previous algorithms to an arbitrary tower, using Couveignes' 
isomorphism algorithm (|9|). As in the previous section, we adapt this algorithm to our 
context, by adding suitable push-down and lift-up operations. 

Let Qo be irreducible of degree d in F p [Xo], such that Ttu /f (xo) ^ 0, with as before 
Uo = F p [A ]/Qo- We let (Gi)o^j<fe and (Uo, • • • , Ufc) be as in Section 3. 

We also consider another sequence (G^)o^i<fc, that defines another tower (U , . . . ,W k ). 
Since (U , . . . ,V k ) is not necessarily primitive, we fall back to the multivariate basis of 
Subsection 2.1: we write elements of on the basis B' ; = {x' Q eo ■ ■ ■ x'^'}, with xq = x' Qi 
^ eo < d and ^ ej < p for 1 ^ j ^ i. 

To compute in U^, we will use an isomorphism — > U^. Such an isomorphism is 
determined by the images Sj = (sq, . . . , Sj) of (xq, . . . , x'), with Sj H U, (we always take 
so = xo). This isomorphism, denoted by a Si , takes as input v written on the basis B£ 
and outputs cr Bi (v) H Uj. 

To analyze costs, we use the functions L and PT introduced in the previous sections. 



We also let 2 ^ ui ^ 3 be a feasible exponent for linear algebra over F p (|13l . Ch. 12) 



Theorem 21. Given Qo and (Gi)o^i<fej one can find s k = (sq, . . . , Sk) in time 0(d u k + 
PT(k) + M(p k+1 d) log(p)). Once they are known, one can apply a Sk and cr" 1 in time 
0(kL(k)). 

Thus, we can compute products, inverses, etc, in UJ, for the cost of the corresponding 
operation in Ufc, plus 0(kL(k)). 

6.1. Solving Artin-S dirtier equations 

As a preliminary, given a H Uj, we discuss how to solve the Artin-Schreier equation 
X p — X = a in Ui. We assume that Tr v ./ ¥p (a) = 0, so this equation has solutions in Uj. 

Because X p — X is F p -linear, the equation can be directly solved by linear algebra, 
but this is too costly. In (Q), Couveignes gives a solution adapted to our setting, that 
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reduces the problem to solving Artin-Schreier equations in Uo- Given a solution <5 G Uj 
of the equation X p — X = a, he observes that any solution ll of 

X p "' * - X = r), with 77 = T pi -i d (a). (8) 

is of the form ll = S — A with A <G Uj_i, hence A is a root of 

X p - X - a + fiP - ll. (9) 

This equation has solutions in Ui_i by hypothesis and hence it can be solved recursively. 
First, however, we tackle the problem of finding a solution of (8). 

For this purpose, observe that the left hand side of (8) is Ui_i-linear and its matrix 
on the basis (1, . . . , x\ _1 ) is 

oG)/3 l _ 1 , P .- lrf ...(V)Ct P .- d " 



Then, algorithm ApproximateAS finds the required solution. 



ApproximateAS 

Input 77 H Uj such that (8) has a solution. 
Output [i H Uj solution of (8). 

(1) let r?o + 77i£i H h T) V -2X V ~ = Push-down(77) 

(2) forje [p-l,...,l], 

(3) return Lift-up(/xia;i + . . . + /i p -ixf _1 ) 



Theorem 22. Algorithm ApproximateAS is correct and takes time 0{L{i)). 

Proof. Correctness is clear from Gaussian elimination. For the cost analysis, remark that 
Pi-i iP i-id has already been precomputed to permit iterated Frobcnius and pseudotrace 
computations. Step 2 takes 0{p 2 ) additions and scalar operations in Uj_i; the overall 
cost is dominated by that of the push-down and lift- up by assumptions on L. □ 

Writing the recursive algorithm is now straightforward. To solve Artin-Schreier equa- 
tions in Uo, we use a naive algorithm based on linear algebra, written NaiveSolve. 
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Artin-Schreier 



Input a,i such that aHUj and Tr Ui( / Fp (a) = 0. 
Output S H Ui such that S p - 5 = a. 

(1) if i = 0, return NaiveSolve(X p - X - a) 

(2) let 77 = Pseudotrace(a, i, i — 1) 

(3) let fi = ApproximateAS(?7) 

(4) let ao = Push-down(a — + fi) 

(5) let A = Artin-Schreier(ao, i — 1) 

(6) return (it + Lift-up(A) 



Theorem 23. Algorithm Artin-Schreier is correct and takes time 0(d u + PT(i)). 

Proof. Correctness follows from the previous discussion. For the complexity, note AS(i) 
the cost for a H Uj. The cost AS(0) of the naive algorithm is 0(M(d) log(p) + d u ), where 
the first term is the cost of computing Xq and the second one the cost of linear algebra. 

When i ^ 1, step 2 has cost PT(z), steps 3, 4 and 6 all contribute 0(L(i)) and step 
5 contributes AS(i — 1). The most important contribution is at step 2, hence AS(i) = 

AS(i - 1) + 0(PT(i)). The assumptions on L imply that the sum PT(1) H h PT(i) is 

0(PT(i)). □ 

6.2. Applying the isomorphism 

We get back to the isomorphism question. We assume that S; = (sq, ■ ■ ■ , Sj) is known 
and we give the cost of applying er Si and its inverse. We first discuss the forward direction. 

As input, v G U' ; is written on the multivariate basis of U' ; ; the output is t = 
c Si (v) H Uj. As before, the algorithm is recursive: we write v = 'Ej <p Vj(x' , . . . , a^__ 1 )3^"' , 
whence 

the sum is computed by Horner's scheme. To speed-up the computation, it is better to 
perform the latter step in a bivariate basis, that is, through a push-down and a lift-up. 

Given t H Uj, to compute v = o~~ (t), we run the previous algorithm backward. We 
first push-down t, obtaining t = to + ■ — h with all tj H Uj_i. Next, we rewrite 

this as t = t' Q + ■ ■ ■ + t' p _iS?~ , with all t' rj H Uj_i, and it suffices to apply cr^ 1 (or 
equivalently cr"^) to all t^. The non-trivial part is the computation of the t^: this is 

done by applying the algorithm FindParameterization mentioned in Subsection 2.3, in the 
extension Uj = Uj_i[Xj]/Pj. 
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Apply Isomorphism 



Input v, i with »£U| written on the basis B^. 
Output a Si (v) H Ui. 

(1) if i = then return u 

(2) write v = ^j <v Vj{x' ,.. . , x'^x'/ 

(3) let Si.o + ■ ■ ■ + Si, p -ix\~ = Push-down(si) 

(4) for j G [0, . . . ,p — 1] let tj = Applylsomorphism(uj, i — 1) 

(5) let t = 

(6) for j G [p - 1, . . . , 0] let t = (si,o H h Si,p-ia;? _1 )£ + tj 

(7) return Lift-up(t) 



Applylnverse 

Input i, i with i H Uj. 

Output (J s T 1 (f) G Uj written on the basis 

(1) if i = then return t 

(2) let t H htp-ixf -1 = Push-down(t) 

(3) let Si,o + ■ ■ ■ + Si, p -ia;f _1 = Push-down(si) 

(4) let t' +- ■ ■+t' p _ 1 X p - 1 = FindParameterization(i H htp-isf" 1 , Si,oH hs liP _ia;f ~ x ) 

(5) return Ej< p Applylnverse(fj, i — l)x'i 



Proposition 24. Algorithms Apply Isomorphism and Applylnverse are correct and both 
take time 0{iL{i)). 

Proof. In both cases, correctness is clear, since the algorithms translate the former dis- 
cussion. As to complexity, in both cases, we do p recursive calls, O(l) push-downs and lift- 
ups, and a few extra operations: for Applylsomorphism, these are p multiplications / addi- 
tions in the bivariate basis of Section 4; for Applylnverse, this is calling the algorithm 
FindParameterization of Subsection 2.3. The costs are 0(pM(p l d)) and 0(p 2 M(p l ~ 1 d)), 
which are in 0(L(i)) by assumption on L. We conclude as in Theorem 17. □ 

6.3. Proof of Theorem 21 

Finally, assuming that only (sn, . . . , Sj-i) are known, we describe how to determine Sj. 
Several choices are possible: the only constraint is that Sj should be a root of Xf — Xi — 
o-sM-i) = Xf ~ Xt - a^M-i) ^ U*. 

Using Proposition 24, we can compute a = <J Si _ 1 (^' i _ 1 ) H U;_i in time 0((i — l)L(i — 
1)) C 0(iL(i)). Applying a lift-up to a, we are then in the conditions of Theorem 23, so 
we can find s, for an extra 0(d w + PT(i)) operations. 

We can then summarize the cost of all precomputations: to the cost of determining Sj, 
we add the costs related to the tower (Un, • • ■ , Ui), given in Sections 3, 4 and 5. After a few 
simplifications, we obtain the upper bound 0(d u + PT(i) + M(p l+1 d) log(p)). Summing 
over i gives the first claim of the theorem. The second is a restatement of Proposition 24. 

7. Experimental results 

We describe here the implementation of our algorithms and an application coming 
from elliptic curve cryptology, isogeny computation. 
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Fig. 1. An example of conversion from the univariate basis to a mixed multivariate basis. 

Implementation. We packaged the algorithms of this paper in a C++ library called 
FAAST and made it available under the terms of the GNU GPL software license from 
http : //www. lix. polytechnique . fr/Labo/Luca.De-Feo/FAAST/ 

FAAST is implemented on top of the NTL library (|29t ) which provides the basic univariate 
polynomial arithmetic needed here. Our library handles three NTL classes of finite fields: 
GF2 for p = 2, zz_p for word-size p and ZZ_p for arbitrary p; this choice is made by the 
user at compile-time through the use of C++ templates and the resulting code is thus 
quite efficient. Optionally, NTL can be combined with the gf2x package (jH) for better 
performance in the p = 2 case, as we did in our experiments. 

All the algorithms of Sections 3-5 are faithfully implemented in FAAST. The algorithms 
Applylsomorphism and Applylnverse have slightly different implementations toUnivariate () 
and toBivariate () that allow more flexibility. Instead of being recursive algorithms do- 
ing the change to and from the multivariate basis = {x' e ° ■ ■ ■ a^ e *}, they only imple- 
ment the change to and from the bivariate basis = {xi-i ei ~ 1 x' i e ' } with ^ e^-i < 
p l ~ 1 d and ^ < p. Equivalently, this amounts to switch between the representations 

HD. and H Uj_i [X-]/ (X'? — X[ — 7$_ 1 ). 

The same result as one call to Applylsomorphism or Applylnverse can be obtained by i 
calls to toUnivaraiteO and toBivariate () respectively. However, in the case where 
several generic Artin-Schreier towers, say (Ug, . . . ,U' k ) and (U ', . . . , U^), are built using 
the algorithms of Section 6, this allows to mix the representations by letting the user 
chose to switch to any of the bases {y^° ■ ■ ■ } where yt is either x- or x" . In other words 
this allows the user to zig-zag in the lattice of finite fields as in Figure 1. 

Besides the alg orithms presented in this paper, FAAST also implements some algorithms 
described in (|10l ) for minimal polynomials, evaluation and interpolation, as they are 
required for the isogeny computation algorithm. 

Experimental results. We compare our timings with those obtained in Magma (0) for 
similar questions. All results are obtained on an Intel Xeon E5430 (2.6GHz). 
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Fig. 2. Build time (left) and isomorphism time (right) with respect to tower height. Plot is in 
logarithmic scale. 

The experiments for the FAAST library were only made for the classes GF2 and zz_p. 
The class ZZ_p was left out because all the primes that can be reasonably handled by 
our library fit in one machine-word. In Magma, there exist several ways to build field 
extensions: 

• quo<U I P> builds the quotient of the univariate polynomial ring U by P € U (written 
magma(l) hereafter); 

• ext<k|P> builds the extension of the field k by P € k[X] (written magma(2)); 

• ext<k|p> builds an extension of degree p of k (written magma(3)). 

We made experiments for each of these choices where this makes sense. 

The parameters to our algorithms are (p,d,k). Thus, our experiments describe the 
following situations: 

• Increasing the height k. Here we take p = 2 and d = 1 (that is, Uo = F2); the in- 
coordinate gives the number of levels we construct and the y-coordinatc gives timings 
in seconds, in logarithmic scale. 

This is done in Figure 2. We let the height of the tower increase and we give timings 
for (1) building the tower of Section 3 and (2) computing an isomorphism with a 
random arbitrary tower as in Section 6. In the latter experiment, only the magma(2) 
approach was meaningful for Magma. 

• Increasing the degree d of Uo. Here we take p = 5 and we construct 2 levels; the x- 
coordinate gives the degree d = [Uo : F p ] and the y-coordinate gives timings in seconds. 
This is done in Figure 3 (left). 

• Increasing p. Here we take d — 1 (thus Uo = ¥ p ) and we construct 2 levels; the in- 
coordinate gives the characteristic p and the y-coordinatc gives timings in seconds. 
This is done in Figure 3 (right). 

The timings of our code are significantly better for increasing height or increasing d. 
Not surprisingly, for increasing p, the magma(l) approach performs better than any other: 
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Fig. 3. Build times with respect to d (left) and p (right). 
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apply a' 1 


apply a 


19 


1.061 


0.269 


1.165 


0.038 


0.599 


0.572 


1.152 


20 


2.381 


0.538 


2.554 


0.076 


1.430 


1.146 


2.333 


21 


5.284 


1.083 


5.645 


0.171 


3.331 


2.306 


4.807 


22 


11.747 


2.202 


12.595 


0.430 


7.730 


4.811 


10.051 


23 


26.441 


4.654 


28.641 


0.961 


18.059 


10.240 


21.494 



Table 1. Some timings in seconds for arithmetics in a generic tower built over F2 using GF2. 



the quo operation simply creates a residue class ring, regardless of the (ir)rcducibility of 
the modulus, so the timing for building two levels barely depend on p. Yet, we notice 
that FAAST has reasonable performances for characteristics up to about p = 50. 

In Tables 1 and 2 we provide some comparative timings for the different arithmetic 
operations provided by FAAST. The column "Primitive" gives the time taken to build one 
level of the primitive tower (this includes the precomputation of the data as described 
in Subsection 4.4); the other entries are self-explanatory. Product and inversion are just 
wrappers around NTL routines: in these operations we didn't observe any overhead com- 
pared to the native NTL code. All the operations stay within a factor of 30 of the cost of 
multiplication, which is satisfactory. 

Finally, we mention the cost of precomputation. The precomputation of the images of 
a as explained in Section 6 is quite expensive; most of it is spent computing pscudotraces. 
Indeed it took one week to precompute the data in Figure 2 (right), while all the other 
data can be computed in a few hours. There is still space for some minor improvement 
in FAAST, mainly tweaking recursion thresholds and implementing better algorithms for 
small and moderate input sizes. Still, we think that only a major algorithmic improvement 
could consistently speed up this phase. 



24 



level 


Primitive 


Push-d. 


Lift-up 


Product 


Inverse 


apply a^ 1 


apply a 


18 


9.159 


0.514 


8.278 


0.321 


6.432 


2.379 


6.624 


19 


21.695 


1.130 


20.388 


1.083 
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32.493 


21 


122.252 
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123.369 


5.307 


92.827 


26.437 


76.780 


22 


275.110 


15.832 


279.338 


10.971 


210.680 


47.956 


134.167 



Table 2. Some timings in seconds for arithmetics in a generic tower built over F2 using zz_p. 
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Fig. 4. Timings for the isogeny algorithm. Isogenies of degree increasing degree are computed 
between curves defined over F 2 ioi . 

Isogeny algorithm. An isogeny is a regular map between two elliptic curves S and S" 
that is also a group morphism. In cryptology, isogenies are used in the Schoof-Elkies- 
Atkin point-counting algorithm (0), but also in more recent constructions (27; 32), and 
the fast computation of isogenies remains a difficult challenge. 

Our interest here is Couveignes' isogeny algorithm (0), which computes isogenies of 
degree ~ p k ; the algorithm relies on the interpolation of a rational function at special 
points in an Artin-Schreier tower. The original algorithm in (@) was first implemented 
in (21); Couveignes' later paper described improvements to speed up the computation, 
but as we already mentioned, a key component, fast arithmetic in Artin-Schreier towers, 
was still missing. The recent paper (fioj ) combines this paper's algorithms and other 
improvements to achieve a completely explicit version of (|9|). 

The algorithm is composed of 5 phases: 

(1) Depending on the degree I of the isogeny to be computed, a parameter k is chosen 
such that p k ~ l {p - 1) > U - 2; 

(2) a primitive tower of height ~ k is computed (the precise height depends on $ and 
£", in the example of figure 4 it is always equal to k — 2); 

(3) an Artin-Schreier tower in which the p fc -torsion points of $ are defined is computed 
and an isomorphism is constructed to the primitive tower; 
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512 


0.207 
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59.82 


50.532 


200.980 
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Table 3. Comparative timings for each phase of the isogeny algorithm using GF2. 



(4) an Artin-Schreier tower in which the p fc -torsion points of $' are defined is computed 
and an isomorphism is constructed to the primitive tower; 

(5) a mapping from S\p k \ to S'\p k \ is computed through interpolation; 

(6) all the possible mappings from <S[p k ] to S"[p k ] are computed through modular 
composition until one is found that yields an isogeny. 

We ran experiments for curves defined over the base field F 2 ioi for increasing isogeny 
degree. Figure 4 shows the timings for two implementations of (fioh based on FAAST and 
one implementation of the same algorithm based on the magma(2) approach; remark 
that the time scale is logarithmic. The running time is probabilistic because step 6 stops 
as soon as it has found an isogeny; we plot the average running times with bars around 
them for minimum/maximum times; the distribution is uniform. Note that the plot in the 
original ISSAC '09 version of this paper shows timings that are one order of magnitude 
worse. This was due to a bug that has later been fixed. 

Table 3 shows comparative timings for each phase of the algorithm. The reason why we 
left step 4 out of the table is that it is essentially the same as step 3 and timings are nearly 
identical. Step 6 is asymptotically the most expensive one; it uses some preconditioning 
to speed up each iteration of the loop. From the point of view of this paper, the most 
interesting steps are 2-5 since they are the only ones that make use of the library FAAST. 

For p = 2, it should be noted that Lercier's isogeny algorithm (|2C)I ) has better per- 
formance; for generic, small, p we mention as well a new algorithm by Lercier and Sir- 
vent ([H). See (H3) for further discussions on isogeny computation. 
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